By the end of the lab, you will be able to …
Download and open code-along-03.qmd
Load the standard packages.
Most tasks related to data analysis are not glorious or fancy.
A lot of time is dedicated to whipping a dataset into the shape needed to be able to analyze it.
This task has different names “data cleaning,” “data management,” “data manipulation,” “data wrangling,” “data transformation.”
dplyr packageThe dplyr package provides a complete set of functions that help you solve the most common data manipulation challenges such as:
|>The pipe operator passes what comes before it into the function that comes after it as the first argument in that function.
dplyr styleIn data transformation pipelines, always use a
|>|>We’ll talk about data visualization pipes later…
Heads Up!
|> (native pipe operator) and %>% (magrittr package) behave identically for simple cases. More info.
function(argument)Functions are (most often) verbs, followed by what they will be applied to in parentheses:
dplyr verbs (functions) will allow you to solve the vast majority of your data manipulation challenges.
dplyr basicsThey are organized into four groups based on what they operate on: rows, columns, groups, or tables.
The verbs all have in common:
dplyr grammarWhat’s the advantage of dplyr grammar? We can sequence data manipulation!
select(), filter(), and drop_na()Use select() to pick specific columns from your dataset.
Use filter() to keep rows that meet a condition.
Use drop_na() to remove rows with missing (NA) values.
group_by() and summarize()Use group_by() to organize your data into groups based on one or more variables.
Use summarize() to compute statistics like total, mean, or median for each group.
gss_all data frame:
# A tibble: 2 × 2
sex freq
<dbl+lbl> <int>
1 1 [male] 1031
2 2 [female] 1363
dplyr() in actionCompare the average and median age at first childbirth for U.S. men and women in 2022.
mutate() in actionUse mutate() to add new columns or change existing ones.
What proportion of new parents were teenagers (e.g., under 18 years old)?
gss_all |>
select(year, agekdbrn) |>
filter(year == 2022) |>
drop_na(agekdbrn) |>
mutate(teen_parent = (agekdbrn < 18) * 1) |>
summarise(proportion = mean(teen_parent))summarise() will report NA
1s
# A tibble: 1 × 1
proportion
<dbl>
1 0.0773
mutate() in actionUse case_when() inside mutate() to create values based on conditions.
What proportion of new parents had their first child as teenagers, in their 20s, 30s, or after age 40?
Freq % % Cum.
----------- ------ -------- --------
<18 186 7.73 7.73
18–29 1704 70.82 78.55
30–39 463 19.24 97.80
40+ 53 2.20 100.00
Total 2406 100.00 100.00
Heads Up!
Overwriting datasets and variables can be intentional or unintentional.
Let’s make a tiny data frame to use as an example:
Suppose you run the following and then you inspect df.
Will the x variable have values 1, 2, 3, 4, 5 or 2, 4, 6, 8, 10?
Do something and show me
Suppose you run the following and then you inspect df.
Will the x variable have values 1, 2, 3, 4, 5 or 2, 4, 6, 8, 10?
Do something, save result, overwriting original
Do something, save result, overwriting original
# A tibble: 5 × 2
x y
<dbl> <chr>
1 2 a
2 4 a
3 6 b
4 8 c
5 10 c
Do something, save result, overwriting original when you shouldn’t
Do something, save result, overwriting original
data frame
Do something, save result, not overwriting original.
Do something and show me
gss_all |>
select(year, agekdbrn) |>
filter(year == 2022) |>
drop_na(agekdbrn) |>
mutate(age_groups = case_when(
agekdbrn < 18 ~ "<18",
agekdbrn >= 18 & agekdbrn <= 29 ~ "18–29",
agekdbrn >= 30 & agekdbrn <= 39 ~ "30–39",
agekdbrn >= 40 ~ "40+",
TRUE ~ NA_character_)) |>
group_by(age_groups) |>
summarise(
count = n(),
proportion = round(count / sum(count), 3)
)Let’s use dplyr grammar to find the median and mode for the childs variable.
gss_all$childs <- zap_missing(gss_all$childs)
gss_all$childs <- as_factor(gss_all$childs)
gss_all$childs <- droplevels(gss_all$childs)
gss_all |>
filter(year == 2024) |>
freq(childs, report.nas = FALSE) |>
tb()dplyr grammar, starting with the name of the df and a pipe
freq() function as usual
tb() function to turn the table into a tibble
# A tibble: 9 × 4
childs freq pct pct_cum
<fct> <dbl> <dbl> <dbl>
1 0 1029 31.4 31.4
2 1 484 14.8 46.2
3 2 851 26.0 72.1
4 3 475 14.5 86.6
5 4 243 7.41 94.0
6 5 96 2.93 96.9
7 6 53 1.62 98.6
8 7 16 0.488 99.1
9 8 or more 31 0.946 100
na.rm is a logical evaluating to TRUE or FALSE indicating whether NA values should be stripped before the computation proceeds.
[1] 40
[1] 41.11279
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 37.00 40.00 41.11 48.00 89.00 32371
summarize()# A tibble: 2 × 4
`as_factor(sex)` count mean median
<fct> <int> <dbl> <dbl>
1 male 869 41.7 40
2 female 891 37.3 40
descr()Univariate statistics for numerical data
Descriptive Statistics
gss_all$hrs1
Label: Number of hours worked last week
N: 75699
hrs1
----------------- ----------
Mean 41.11
Std.Dev 14.12
Min 0.00
Q1 37.00
Median 40.00
Q3 48.00
Max 89.00
MAD 7.41
IQR 11.00
CV 0.34
Skewness 0.18
SE.Skewness 0.01
Kurtosis 1.42
N.Valid 43328.00
N 75699.00
Pct.Valid 57.24
descr()gss_all |>
filter(year == 2024) |>
group_by(as_factor(sex)) |>
drop_na(hrs1, sex) |>
descr(hrs1,
stats = "common") |>
tb()Error in if (grepl("^get\\(", deparse(call[[.p$var]]))) {: the condition has length > 1
# A tibble: 2 × 10
`as_factor(sex)` variable mean sd min med max n.valid n
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 male V1 41.7 13.7 0 40 89 869 869
2 female V1 37.3 13.7 0 40 89 891 891
# ℹ 1 more variable: pct.valid <dbl>
Political polarization is high in the U.S. today and attitudes about gender and family behavior have been heavily debated.
Using the most recent survey, do more liberals than conservatives think sex before marriage is ‘not wrong at all’?
How do we find out?
Let’s familiarize ourselves with the premarsx and polviews variables.
In the console, type ?premarsx and hit enter. The Help pane will show you the question text, response options and values.
Now, do the same for polviews.
gss_all <- gss_all |>
mutate(pol3cat = case_when(
polviews >= 1 & polviews <= 3 ~ "Liberal",
polviews == 4 ~ "Moderate",
polviews >= 5 & polviews <= 7 ~ "Conservative",
TRUE ~ NA_character_),
pol3cat = factor(pol3cat,
levels = c("Liberal", "Moderate", "Conservative"))
)polviews
What’s your conclusion to our initial research question?
% who think sex relations before marriage is __________, by political views
Cross-Tabulation, Column Proportions
premarsx * sex
Data Frame: gss_all
---------- ----- ---------------- ---------------- ----------------
sex 1 2 Total
premarsx
1 4159 ( 20.6%) 7116 ( 28.0%) 11275 ( 24.7%)
2 1499 ( 7.4%) 2388 ( 9.4%) 3887 ( 8.5%)
3 3904 ( 19.3%) 4792 ( 18.9%) 8696 ( 19.1%)
4 10672 ( 52.7%) 11086 ( 43.7%) 21758 ( 47.7%)
Total 20234 (100.0%) 25382 (100.0%) 45616 (100.0%)
---------- ----- ---------------- ---------------- ----------------